Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

نویسندگان

  • Marcel Bollmann
  • Stefanie Dipper
  • Julia Krasselt
  • Florian Petran
چکیده

This paper presents work on manual and semi-automatic normalization of historical language data. We first address the guidelines that we use for mapping historical to modern word forms. The guidelines distinguish between normalization (preferring forms close to the original) and modernization (preferring forms close to modern language). Average inter-annotator agreement is 88.38% on a set of data from Early New High German. We then present Norma, a semi-automatic normalization tool. It integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way. The tool dynamically updates the set of rule entries, given new input. Depending on the text and training settings, normalizing 1,000 tokens results in overall accuracies of 61.78–79.65% (baseline: 24.76–59.53%).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic?

The identification of spelling variants in English and German historical texts: manual or automatic? Dawn ARCHER (University of Central Lancashire) Andrea ERNST-GERLACH, Sebastian KEMPKEN, Thomas PILZ (Universität Duisburg-Essen) Paul RAYSON (Lancaster University) The identification of spelling variants in English and German historical texts: manual or automatic?

متن کامل

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previou...

متن کامل

Normalizing Medieval German Texts: from rules to deep learning

The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....

متن کامل

Spelling Normalization of Historical German with Sparse Training Data

Recently, there has been a growing interest in historical language corpora. Projects to create such corpora exist for a variety of languages such as German (Scheible et al. 2011), Spanish (SánchezMarco et al. 2010), or Slovene (Erjavec 2012). Annotation of these corpora is complicated by the fact that specialized tools for these language stages are typically not available. A common approach is ...

متن کامل

Information Access to Historical Documents from the Early New High German Period

With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we comment on the difficulties that result for retrieval and data mining on historical texts. In the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012